Search CORE

39 research outputs found

EOLE: Toward a Practical Implementation of Value Prediction

Author: Perais Arthur
Seznec André
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 22/06/2015
Field of study

International audienceA new architecture, Early/Out-of-Order/Late Execution (EOLE), leverages value prediction to execute a significant number of instructions outside the out-of-order engine. This approach reduces the issue width, which is a major contributor to both out-of-order engine complexity and the register file port requirement. This reduction paves the way for a truly practical implementation of value prediction

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Rennes 1

Rebasing Microarchitectural Research with Industry Traces

Author: Feliu Josue
Jiménez Daniel A,
Perais Arthur
Ros Alberto
Publication venue
Publication date: 01/10/2023
Field of study

Microarchitecture research relies on performance models with various degrees of accuracy and speed. In the past few years, one such model, ChampSim, has started to gain significant traction by coupling ease of use with a reasonable level of detail and simulation speed. At the same time, datacenter class workloads, which are not trivial to set up and benchmark, have become easier to study via the release of hundreds of industry traces following the first Championship Value Prediction (CVP-1) in 2018. A tool was quickly created to port the CVP-1 traces to the ChampSim format, which, as a result, have been used in many recent works. In this paper, we revisit this conversion tool and find that several key aspects of the CVP-1 traces are not preserved by the conversion. We therefore propose an improved converter that addresses most conversion issues as well as patches known limitations of the CVP-1 traces themselves. We evaluate the impact of our changes on two commits of ChampSim, with one used for the first Instruction Championship Prefetching (IPC-1) in 2020. We find that the performance variation stemming from higher accuracy conversion is significant

DIGITUM Universidad de Murcia (España)

EOLE: Paving the Way for an Effective Implementation of Value Prediction

Author: Perais Arthur
Seznec André
Publication venue: HAL CCSD
Publication date: 22/11/2013
Field of study

A fait l'objet d'une publication au "International Symposium on Computer Architecture (ISCA) 2014" Lien : http://people.irisa.fr/Arthur.Perais/data/ISCA%2714_EOLE.pdfEven in the multicore era, there is a continuous demand to increase the performance of single-threaded applications. However, the conventional path of increasing both issue width and instruction window size inevitably leads to the power wall. Value prediction (VP) was proposed in the mid 90's as an alternative path to further enhance the performance of wide-issue superscalar processors. Still, it was considered up to recently that a performance-effective implementation of Value Prediction would add tremendous complexity and power consumption in almost every stage of the pipeline. Nonetheless, recent work in the field of VP has shown that given an efficient confidence estimation mechanism, prediction validation could be removed from the out-of-order engine and delayed until commit time. As a result, recovering from mispredictions via selective replay can be avoided and a much simpler mechanism - pipeline squashing - can be used, while the out-of-order engine remains mostly unmodified. Nonetheless, VP and validation at commit time entail strong constraints on the Physical Register File. Write ports are needed to write predicted results and read ports are needed in order to validate them at commit time, potentially rendering the overall number of ports unbearable. Fortunately, VP also implies that many single-cycle ALU instructions have their operands predicted in the front-end and can be executed in-place and in-order. Similarly, the execution of single-cycle instructions whose result has been predicted can be delayed until just before commit since predictions are validated at commit time. Consequently, a significant number of instructions - 10% to 60% in our experiments - can bypass the out-of-order engine, allowing the reduction of the issue width, which is a major contributor to both out-of-order engine complexity and register file port requirement. This reduction paves the way for a truly practical implementation of Value Prediction. Furthermore, since Value Prediction in itself usually increases performance, our resulting {Early | Out-of-Order | Late} Execution architecture (EOLE), is often more efficient than a baseline VP-augmented 6-issue superscalar while having a significantly narrower 4-issue out-of-order engine.Même à l'ère des multicoeurs, il existe une demande continue pour l'augmentation de la performance sur les applications mono-threads. Cependant, la solution conventionnelle consistant à augmenter la largeur d'exécution ainsi que la taille de la fenêtre d'instructions se heurte inévitablement au mur de la consommation. La Prédiction de Valeurs (VP) a été proposée dans les années 90 comme une alternative permettant d'améliorer la performance des processeurs superscalaires. Cela étant, une implémentation intéressante du point de vue cout-efficacité était jusqu'ici considérée comme impossible à cause de la complexité ainsi que de la consommation induite. Cependant, des travaux récents dans le domaine de la Prédiction de Valeurs ont montrés qu'avec un mécanisme d'estimation de la confiance efficace, la validation d'une prédiction pouvait être repoussée au moment ou l'instruction est retirée du pipeline. Conséquemment, récupérer d'une mauvaise prédiction via une ré-exécution sélective peut-être évité et un mécanisme bien plus simple - vidage du pipeline - peut-être utilisé. Toute la partie du processeur chargée d'exécuter les instructions dans le désordre n'est donc pas modifiée. Néanmoins, VP et la validation au retirement impliquent des contraintes fortes sur le fichier de registres. Des ports d'écriture sont requis pour écrire les prédictions et des ports de lecture sont requis pour valider les prédictions au retirement. Heureusement, VP implique aussi que beaucoup d'instructions simples ont leurs opérandes disponibles tôt dans le pipeline et peuvent être exécutées dans l'ordre. De façon similaire, l'exécution des instructions simples ayant été prédites peut être reportée aux derniers étages du pipeline puisque les prédictions sont validées au retirement. Au final, une proportion significative des instructions - 10% to 60% dans notre étude - peuvent contourner le moteur d'exécution dans le désordre, ce qui permet de réduire la largeur d'exécution, qui contribue grandement à la complexité du processeur. Cette réduction ouvre la porte à une implémentation réaliste de la Prédiction de Valeurs. De plus, puisque la VP augmente la performance, notre architecture {Early | Out-of-Order | Late} Execution architecture (EOLE), est souvent plus performante qu'une architecture superscalaire implémentant la VP tout en ayant un moteur d'exécution dans le désordre bien moins complexe

INRIA a CCSD electronic archive server

Practical Data Value Speculation for Future High-end Processors

Author: Perais Arthur
Perais Arthur
Seznec André
Seznec André
Publication venue
Publication date: 17/11/2013
Field of study

A fait l'objet d'une publication Ã "High Performance Computer Architecture (HPCA) 2014" Lien : http://people.irisa.fr/Arthur.Perais/data/HPCA%2714_Practical_VP.pdf; Dedicating more silicon area to single thread performance will necessarily be considered as worthwhile in future - potentially heterogeneous - multicores. In particular, Value prediction (VP) was proposed in the mid 90's to enhance the performance of high-end uniprocessors by breaking true data dependencies. In this paper, we reconsider the concept of Value Prediction in the contemporary context and show its potential as a direction to improve current single thread performance. First, building on top of research carried out during the previous decade on confidence estimation, we show that every value predictor is amenable to very high prediction accuracy using very simple hardware. This clears the path to an implementation of VP without a complex selective reissue mechanism to absorb mispredictions, where prediction is performed in the in-order pipeline frond-end and validation is performed in the in-order pipeline back-end, while the out-of-order engine is only marginally modified. Second, when predicting back-to-back occurrences of the same instruction, previous context-based value predictors relying on local value history exhibit a complex critical loop that should ideally be implemented in a single cycle. To bypass this requirement, we introduce a new value predictor VTAGE harnessing the global branch history. VTAGE can seamlessly predict back-to-back occurrences, allowing predictions to span over several cycles. It achieves higher performance than previously proposed context-based predictors. Specifically, using SPEC'00 and SPEC'06 benchmarks, our simulations show that combining VTAGE and a Stride-based predictor yields up to 65% speedup on a fairly aggressive pipeline without support for selective reissue.; DÃ©dier plus de surface de silicium Ã la performance sÃ©quentielle sera nÃ©cessairement considÃ©rÃ© comme digne d'interÃªt dans un futur proche. En particulier, la PrÃ©diction de Valeurs (VP) a Ã©tÃ© proposÃ©e dans les annÃ©es 90 afin d'amÃ©liorer la performance sÃ©quentielle des processeurs haute-performance en cassant les dÃ©pendances de donnÃ©es entre instructions. Dans ce papier, nous revisitons le concept de PrÃ©diction de Valeurs dans un contexte contemporain et montrons son potentiel d'amÃ©lioration de la performance sÃ©quentielle. SpÃ©cifiquement, utilisant les suites de benchmarks SPEC'00 et SPEC'06, nos simulations montrent qu'en combinant notre prÃ©dicteur, VTAGE, avec un prÃ©dicteur de type Stride, des gains de performances allant jusqu'Ã 65% peuvent Ãªtre observÃ©s sur un pipeline relativement agressif mais sans rÃ©-exÃ©cution sÃ©lective en cas de mauvaise prÃ©diction. Document type: External research repor

Scipedia

Leveraging Targeted Value Prediction to Unlock New Hardware Strength Reduction Potential

Author: Perais Arthur
Publication venue: ACM IEEE
Publication date: 18/10/2021
Field of study

International audienceValue Prediction (VP) is a microarchitectural technique that speculatively breaks data dependencies to increase the available Instruction Level Parallelism (ILP) in general purpose processors. Despite recent proposals, VP remains expensive and has intricate interactions with several stages of the classical superscalar pipeline. In this paper, we revisit and simplify VP by leveraging the irregular distribution of the values produced during the execution of common programs. First, we demonstrate that a reasonable fraction of the performance uplift brought by a full VP infrastructure can be obtained by predicting only a few "usual suspects" values. Furthermore, we show that doing so allows to greatly simplify VP operation as well as reduce the value predictor footprint. Lastly, we show that these Minimal and Targeted VP infrastructures conceptually enable Speculative Strength Reduction (SpSR), a rename-time optimization whereby instructions can disappear at rename in the presence of specific operand values

Hal - Université Grenoble Alpes

HAL Descartes

Exploiting Value Prediction With Quasi-Unlimited Resources

Author: Perais Arthur
Publication venue: HAL CCSD
Publication date: 21/06/2012
Field of study

Recent trends regarding general purpose microprocessors have focused on Thread-Level Parallelism (TLP), and in general, on parallel architectures such as multicores. However, due to Amdahl's law, the gain to be had from the parallelization of a program is limited since there will always be an incompressible sequential part in the program. The execution time of this part only depends on the sequential performance of the processor the program is executed on. Value Prediction was proposed in the late 90's as a way to improve sequential performance by predicting instructions results, allowing the hardware to break data dependencies between instructions and thus extract more Instruction Level Parallelism (ILP) from the code. In the meantime, very accurate Geometric Length indirect branch target predictor such as ITTAGE were proposed. Indirect Branch Target Prediction and Value Prediction exhibit some similarities in concept, which is why we present a value predictor borrowing from both the Geometric Length indirect target branch predictor ITTAGE and existing work in the field of Value Prediction. As transistor budget is not expected to be a problem for future microprocessors, we study the behavior of the Value TAGE predictor for both finite and ''infinite'' sizes. We evaluate VTAGE performance on standard integer and floating-point workloads as well as on vectorized code

HAL-CentraleSupelec

HAL-Rennes 1

A Case for Speculative Strength Reduction

Author: Perais Arthur
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2021
Field of study

International audienceMost high performance general purpose processors leverage register renaming to implement optimizations such as move elimination or zero-idiom elimination. Those optimizations can be seen as forms of strength reduction whereby a faster but semantically equivalent operation is substituted to a slower operation. In this letter, we argue that other reductions can be performed dynamically if input values of instructions are known in time, i.e.,~prior to renaming. We study the potential for leveraging Value Prediction to achieve that goal and show that in SPEC2k17, an average of 3.3% (up to 6.8%) of the dynamic instructions could dynamically be strength reduced. Our experiments suggest that a state-of-the-art value predictor allows to capture 59.7% of that potential on average (up to 99.6%)

Hal - Université Grenoble Alpes

High Performance General Purpose Architecture and Microarchitecture

Author: Perais Arthur
Publication venue: HAL CCSD
Publication date: 27/06/2022
Field of study

International audienceIn this talk, we will provide a broad overview of why general purpose processors are here to stay and some research directions pursued by the Computer Architecture community. We will also briefly present recent work done at TIMA to improve the performance of general purpose processors.</p

Hal - Université Grenoble Alpes

La prédiction de valeurs comme moyen d'augmenter la performance des processeurs superscalaires

Author: Perais Arthur
Publication venue: HAL CCSD
Publication date: 24/09/2015
Field of study

Although currently available general purpose microprocessors feature more than 10 cores, many programs remain mostly sequential. This can either be due to an inherent property of the algorithm used by the program, to the program being old and written during the uni-processor era, or simply to time to market constraints, as writing and validating parallel code is known to be hard. Moreover, even for parallel programs, the performance of the sequential part quickly becomes the limiting improvement factor as more cores are made available to the application, as expressed by Amdahl's Law. Consequently, increasing sequential performance remains a valid approach in the multi-core era. Unfortunately, conventional means to do so - increasing the out-of-order window size and issue width - are major contributors to the complexity and power consumption of the chip. In this thesis, we revisit a previously proposed technique that aimed to improve performance in an orthogonal fashion: Value Prediction (VP). Instead of increasing the execution engine aggressiveness, VP improves the utilization of existing resources by increasing the available Instruction Level Parallelism. In particular, we address the three main issues preventing VP from being implemented. First, we propose to remove validation and recovery from the execution engine, and do it in-order at Commit. Second, we propose a new execution model that executes some instructions in-order either before or after the out-of-order engine. This reduces pressure on said engine and allows to reduce its aggressiveness. As a result, port requirement on the Physical Register File and overall complexity decrease. Third, we propose a prediction scheme that mimics the instruction fetch scheme: Block Based Prediction. This allows predicting several instructions per cycle with a single read, hence a single port on the predictor array. This three propositions form a possible implementation of Value Prediction that is both realistic and efficient.Bien que les processeurs actuels possèdent plus de 10 cœurs, de nombreux programmes restent purement séquentiels. Cela peut être dû à l'algorithme que le programme met en œuvre, au programme étant vieux et ayant été écrit durant l'ère des uni-processeurs, ou simplement à des contraintes temporelles, car écrire du code parallèle est notoirement long et difficile. De plus, même pour les programmes parallèles, la performance de la partie séquentielle de ces programmes devient rapidement le facteur limitant l'augmentation de la performance apportée par l'augmentation du nombre de cœurs disponibles, ce qui est exprimé par la loi d'Amdahl. Conséquemment, augmenter la performance séquentielle reste une approche valide même à l'ère des multi-cœurs.Malheureusement, la façon conventionnelle d'améliorer la performance (augmenter la taille de la fenêtre d'instructions) contribue à l'augmentation de la complexité et de la consommation du processeur. Dans ces travaux, nous revisitons une technique visant à améliorer la performance de façon orthogonale : La prédiction de valeurs. Au lieu d'augmenter les capacités du moteur d'exécution, la prédiction de valeurs améliore l'utilisation des ressources existantes en augmentant le parallélisme d'instructions disponible.En particulier, nous nous attaquons aux trois problèmes majeurs empêchant la prédiction de valeurs d'être mise en œuvre dans les processeurs modernes. Premièrement, nous proposons de déplacer la validation des prédictions depuis le moteur d'exécution vers l'étage de retirement des instructions. Deuxièmement, nous proposons un nouveau modèle d'exécution qui exécute certaines instructions dans l'ordre soit avant soit après le moteur d'exécution dans le désordre. Cela réduit la pression exercée sur ledit moteur et permet de réduire ses capacités. De cette manière, le nombre de ports requis sur le fichier de registre et la complexité générale diminuent. Troisièmement, nous présentons un mécanisme de prédiction imitant le mécanisme de récupération des instructions : La prédiction par blocs. Cela permet de prédire plusieurs instructions par cycle tout en effectuant une unique lecture dans le prédicteur. Ces trois propositions forment une mise en œuvre possible de la prédiction de valeurs qui est réaliste mais néanmoins performante

HAL-CentraleSupelec

Thèses en Ligne

INRIA a CCSD electronic archive server

Theses.fr

HAL-Rennes 1